On words and genes

نویسندگان

  • Anastasios A. Tsonis
  • Panagiotis A. Tsonis
چکیده

Several years ago we published in Complexity two articles, one dealing with Zipf’s law in languages [1] and the other with gene evolution [2]. Wentian Li [3, 4] sent to Complexity two comments on these two articles. These comments were published in Complexity without either Li or Complexity bringing them to our attention. As a result we were unaware of the contents of these comments, and we only recently found out about their existence. Although we feel that old wounds should not be reopened, we are compelled to reply because we do not want our absence of reply to be considered as acceptance of the criticism by Li. First we address Li’s comments on the Zipf’s law article. Briefly, if one considers a large English text and ranks the frequency of each word so that the most frequent word has rank one, the second most frequent word has rank two, and so on, one obtains a simple relation of the form f r , where f is the frequency and r is the rank. This power law (a straight line with slope 1 in a double logarithmic plot) holds true for all languages and is known as Zipf’s law. Initially, this law was hailed as the supreme law of linguistics and was suggested that it manifests the principle of least effort in human behavior. According to this principle, languages evolved so that communication will be easy and efficient. As is evident from Li’s comment and from some of his earlier work on this subject [5], this interpretation of Zipf’s law is challenged from studies of random texts. However, although Li argues that Zipf’s law can be derived from random texts, he avoids addressing the problems associated with such an approach. A random text is produced by randomly selecting from an alphabet ofM letters and a space character. If the probability of selecting the space character is p (0 p 1) and that all letters are equally probable, then the probability of each letter is (1 p)/M. Any combination of letters between two spaces forms a word. Thus, all M one-letter words occur with a probability of cp(1 p)/M (probability of space character times probability of a letter times probability of space character times an appropriate normalization factor, c). All M two-letter words occur with a probability of cp(1 p)/M, and so on. It follows, that words will simply be ranked by length and, in general, all M words of length m will occur with probability cp(1 p)/M. This implies that the function relating the frequency of words to rank, f(r), is a steplike function structured in plateaus with the first plateau containing the M one-letter words, the second plateau containing the M two-letter words, etc. The only way to derive Zipf’s law from these purely probabilistic arguments is to assume that rank is equal to M. This is a very weak assumption because it considers only one point (the last one) in each plateau. The plateaus may be eliminated by manipulating with unequal probabilities for the letters, but this (1) is fiddling with the simulations until we get what we want, (2) the exponent of the alleged power law is not close to one but it is somewhere around 1.5 (see Figure 2 in Li [5]), and (3) no matter what the assumed probabilities for the letters are, still any combination of letters constitutes a word, which is not true for real languages. Thus, although we have learned some things from studies of random texts, random texts are not convincing in dismissing the principle of least effort. Now consider this argument. Take a text and shuffle the words. This produces a different text. Because both texts consist of the same words they will both yield the same Zipf’s law. The shuffled text, however, is an incomprehensible sequence of words that certainly does not transmit any information not to mention transmitting it efficiently. Obviously, Zipf’s law is trivial! This is a good argument and it has caused a stir. But is it enough to dismiss the least effort suggestion? We argue that, even in this case, one cannot without reservation dismiss the fact that there is a fundamental aspect of languages that makes them amenable to formal analysis: Linguistic structure consists of small units that are grouped together according to certain rules. In our article we also speculated that the (downward) deviation from that law observed for high-ranked words (the, of, a, to, and) is not accidental but it may be an important aspect of the evolution of some languages. More specifically, we proposed that these words act as “connector” words and do not obey the law. Li [3] takes issue with this calling it too farfetched of an assumption. He argues that this deviation (1) is not observed in all languages and (2) words like a and the can be assumed as synonyms and as such if they are grouped together the combined frequency bring the points close to the f r line. With respect to the first argument, if Li had read carefully our article, he would have noticed that we state that such connector words may not exist in all languages. With respect to the second argument let us consider the sentence:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection of Staphylococcus Aureus Enterotoxin Genes A-E

Abstract Background and Objective: The main cause of spreading staphylococcal infections among patients is the healthy carriers working in hospitals. With the secretion of different sorts of toxins such as entrotoxin, this bacteria can provide the conditions for attacking on the host. The main objective of this study is identification of the characteristics and differences in the Staphylococc...

متن کامل

شناسایی ژن‌های مرتبط با بقا در سرطان کلیه با استفاده از روش مؤلفه‌های اصلی لاسو

Background: Identification of correlated genes with survival by gene expression data is an important application of microarray data. The purpose of this study is to identify correlated genes with survival of conventional renal cell carcinoma (cRCC) patients based on gene expression profiles. Methods: This study is a survival analysis with high dimensional covariates and containing 14814 gene...

متن کامل

The Effect of Endurance Training on the Expression of PRDX6 and KAT2B Genes in Hippocampus of Beta Amyloid-Induced Rat Model of Alzheimer's Disease: An Experimental Study

Background and Objectives: Alzheimer's disease is the most common form of dementia. KAT2B (Lysine Acetyltransferase 2B) is a mitochondrial protein known as mitochondria clearing control organ by mitophagy. PRDX6 (Peroxiredoxin 6) is a key regulator of mitophagy and plays a critical role in maintaining mitochondrial ROS (Reactive oxygen species) homeostasis. Therefore, the purpose of this study ...

متن کامل

Dysregulation of the WNT Signaling Pathway Through Methylation of Wnt Inhibitory Factor 1 and Dickkopf-1 Genes among AML Patients at the Time of Diagnosis

Background: In acute myeloblastic leukemia, a large number of tumor suppressor genes are silenced through DNA methylation such as CDKN2B & p73. Wnt inhibitory factor 1 (WIF1) and Dickkopf-3 (DKK-1) are negative regulators of Wnt signaling pathway. In the present study, we evaluated the methylation status of WIF1 and DKK-1 genes in acute myeloblastic leukemia patients. Patients and Methods: ...

متن کامل

The Detection of Fimbrial Pathogenic Genes in E. coli Strains Isolated from Patients with Urinary Tract Infection

Abstract Bachground and objectives: The ability of adherence to the surface of host cell is very critical in the colonization of microbial pathogens. It has been revealed that E. coli strains that infect urinary tracts have different fimbrea such as I, S, FIC, Dr, and fimbrial adhesions. Material and Methods: In this study, 363 urine samples were obtained from patients with urinary tract infect...

متن کامل

Prevalence of SHV/CTX-M/TEM (ESBL) Beta-lactamase Resistance Genes in Escherichia Coli Isolated from Urinary Tract Infections in Tehran, Iran

Abstract Background and objectives: Beta-lactamase enzymes are the most causes of resistance to antibiotics among gram-negative bacteria. Nowadays, Infections due to ESBLs are being increased throughout the world and is considered as a new burden to the health systems. This study aimed at determining the sensitivity pattern of E.coli isolates to beta-lactam antibiotics, and investigating the pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Complexity

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2003